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A  Robust  Parallel  Iterative  Solver  for  Sparse  Nonsymmetric  Linear  Systems 


A.  Introduction 

Block  tridiagonal  linear  systems  occur  frequently  in  scientific  and  engineering  computations,  often 
from  a  linearisation  and  discretisation  of  a  nonlinear  differential  operator.  In  this  setting  three  features 
are  common:  many  such  systems  must  be  solved,  each  system  is  large-scale,  and  the  coefficient  matrix  is 
possibly  nonsymmetric.  The  first  feature  implies  that  an  iterative  scheme  is  desirable  since  it  allows  the 
linear  systems  to  be  solved  only  to  the  required  accuracy,  which  needs  to  be  high  only  for  the  final  few  sys¬ 
tems.  The  second  feature  implies  that  an  iterative  scheme  may  be  necessary,  since  factorisation  of  a  large 
linear  system  may  exceed  the  available  computer  memory.  The  third  feature  rules  out  several  highly  suc¬ 
cessful  algorithms  for  symmetric  positive  definite  systems,  such  as  the  conjugate  gradient  (CG)  method. 

This  proposal  describes  a  method  developed  in  [KaSa86j,  funded  by  AFOSR,  for  the  solution  of  non¬ 
symmetric  block  tridiagonai  systems,  presents  test  results  and  outlines  future  research  directions  for 
developing  a  robust  parallel  iterative  solver  for  nonsymmetric  linear  systems.  The  method,  block-SSOR 
with  CG  acceleration,  has  two  important  properties.  Firstly,  it  is  robust.  Convergence  is  assured  even  for 
poorly  conditioned  systems  with  arbitrary  eigenvalue  distributions.  Secondly,  the  method  can  be  imple¬ 
mented  with  good  parallelism,  making  it  suitable  for  multiprocessor  machines. 

Many  current  iterative  solvers  for  nonsymmetric  linear  systems  do  not  possess  the  first  of  these  desir¬ 
able  properties.  Chebyshev  and  related  least-squares  polynomial  methods  can  fail  if  the  origin  is  con¬ 
tained  in  the  convex  hull  of  the  eigenvalues;  in  addition  some  estimate  of  the  extremal  eigenvalues  is 
required.  Generalisations  of  the  conjugate  gradient  method  such  as  GCR(k)  and  ORTHOMIN(k)  require 
that  the  matrix  has  a  positive  definite  symmetric  part.  GMRES(k)  does  not  place  any  restriction  on  the 
matrix,  but  can  require  retaining  a  large  number  k  of  search  directions  and  hence  large  amounts  of  addi¬ 
tional  storage  in  order  to  achieve  convergence.  Finally,  iterative  solvers  applied  to  the  normal  equations 
can  cause  a  loss  of  information  as  well  as  a  squaring  of  the  condition  number,  a  potential  source  of  numeri¬ 
cal  instability. 

Section  2  of  this  proposal  outlines  the  projection  method  under  consideration.  Section  3  discusses 
some  implementation  variations  that  make  it  suitable  for  a  multiprocessor.  Section  4  presents  numerical 
results  from  tests  run  on  an  Affiant  FX/8  minisupercomputer,  and  section  5  summarises  conclusions  about 
the  algorithm,  and  suggests  extensions  for  the  general  sparse  case  and  other  applications. 


B. 

as 


Description  of  the  Algorithm 

Given  the  nonsingular  system  Ax  =  f  with  AeRNxN  and  f  eRN,  let  the  rows  of  A  be  partitioned 


A 


B,T' 

B2t 


(2.1) 


where  BiCRNxM  and  B2eR^X^ Also,  let  f  be  partitioned  conformally  with  A  as  f  =  (f^  ,f  J)^. 
Then  the  block-row  symmetric  successive  over-relaxation  (SSOR)  method  [BjE179]  with  respect  to  this 
partitioning  is  defined  as 


Algorithm  1  (Block-row  SSOR) 

(a)  Choose  Xq  and  (0,2).  Set  k  =  0. 

(b)  Let 

*1  -  xk  +  wB,  (BjT  Bj)-1  (f, -Bfxd 


z2  =  zl  +  oB2  (B2T  B2)_1  (f2  — B^  Zj) 
z3  =  z2  4-  uiB2  {B2  B2)-1  (f2  —  B^  z2) 

Xfc+i  =  z3  +  wB1(B1TB1)~1(fl-B1Tz3) 

(c)  If  a  stopping  condition  is  not  met,  set  k  —  k  +  land  go  to  (b) 

In  JKaSa86]  it  was  shown  that  the  optimal  relaxation  parameter  OJ  is  1;  in  this  case  the  matrices 
t^Bj  (Bj  Bj)_1  Bj  become  orthogonal  projectors  and  the  algorithm  can  be  simplified: 

Algorithm  2  (Optimal  block-row  SSOR) 

(a)  Choose  Xq  and  set  k  =  0. 

(b)  Set  Xk+1  =  Qxk  +  f. 

(c)  If  a  stopping  condition  is  not  met  set  k  =  k  + 1  and  go  to  (b). 

Here  the  iteration  matrix  is  defined  as 

Q-fl-P.XI-PJfl-P,), 

where 

I-P1=I-Bi(BiTBl)-‘B,T 

=  projector  onto  nullspace  of  Bj 

and  the  modified  right-hand  side  vector  is 

f  -  [(I+(I-P1)(I-P2))(Bl+)T  ,  (I-P1)(B2+)t]  f, 

with 

Bi+  :=  (BiTBi)-1BiT 

Because  Q  is  a  product  of  orthogonal  projectors  it  is  easy  to  show  that  Q  is  symmetric  and  that  the 
spectral  radius  p( Q)  is  bounded  above  by  1.  Furthermore,  since  A  is  nonsingular  p(Q)  <  1.  Unfor¬ 
tunately  p{ Q)  depends  on  COS2#,  where  6  is  the  smallest  angle  between  the  two  subspaces  Sj  =  null(Bj  ) 
and  S2  =  null(B^)  (see  [BjGo73]  for  a  definition  of  6  ).  This  angle  can  be  small,  with  a  correspondingly 
slow  rate  of  convergence  for  Algorithm  2. 

Therefore  some  acceleration  technique  is  desirable.  Since  p(Q)  <  1  implies  that  (I  —  Q)  is  positive 
definite,  the  conjugate  gradient  method  is  a  natural  choice.  Using  a  version  of  CG  from  [Reid71]  gives  the 
algorithm  presented  in  [KaSa86]  and  that  this  paper  further  examines: 

Algorithm  S  (Block  SSOR  with  CG  acceleration) 

(a)  Initialise  x0  =  0,  k  =  0,  p0  =  r0  =  f,  and  p0  —  rj  r0. 

(b)  Set 

wk  =  (I-Q)pk 

Qrk  =  Pic/Pk'w), 

^ic+l  —  Xk  -f  Q^kPk 

rk+l  =  rk  -  »kwk 
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*  ^v. 


—  T 

Pk+l  —  rk+l  rk+l 

Ac+1  ~  Ac+1  /  Pk 
Pk+1  =  rk-t-l+AcPk 

(c)  If  a  stopping  condition  is  not  met,  set  k  =  k  + 1-  and  go  to  (b). 

This  last  method  will  be  called  BSSOR  in  the  remaining  sections,  with  only  one  modification.  If  Xq 
is  chosen  to  be  the  vector  f,  then  P  jTg  =  P  jpo  =  0.  In  this  case 

wo^ii-a-p^i-p^i-p^jpo 
=  Po-(I-Pi)(I-P2)Po 
=  (I-Pl)p0-(I-PlXl-P2)Po 
=S(I-P1)P2P0 

also  satisfies  P  jWq  =  0.  Since  r^+1  is  a  linear  combination  of  and  and  is  a  linear  combina¬ 
tion  of  pjj  and  rjt+1,  an  induction  argument  shows  that  PjT^  =  Pjpjj.  =  0  for  all  k.  As  above,  wk  can 
then  be  computed  using  only  two  rather  than  three  projections  per  iteration.  This  reduces  execution  times 
by  20%  -  25%.  This  initialisation  also  allows  the  updating  of  the  true  residual  on  each  step,  avoiding  the 
need  for  an  additional  multiplication  by  A.  •  The  true  residual  can  be  updated  by  subtracting  a^Ap^  on 
each  step.  Since  PjPfc  =  0,  Bj  p^  =  0  and  those  components  of  the  true  residual  remain  0.  Since 
Bj  Pfc  is  computed  as  the  first  step  in  finding  W^,  the  corresponding  components  of  the  true  residual  can 
be  updated  by  a  saxpy  operation.  This  leads  to  a  5%  reduction  in  the  overall  execution  time. 

C.  Implementations 
C.l.  Partitioning  Choices 

On  each  iteration  BSSOR  requires  the  formation  of  the  product 

w=(I-P1)P2 

and  hence  the  solution  of  two  Unear  least  squares  problems.  This  seems  to  be  an  impractical  way  to  solve 
linear  systems  because  when  A  is  dense  solving  a  single  least  squares  problem  is  generally  more  difficult 
than  solving  the  original  system.  However,  when  A  is  block  tridiagonal,  that  is,  when  A  =  [EpT^Dj] 
where  each  Ej,  Tj,  and  Dj  is  an  nXn  matrix,  it  is  possible  to  select  partitionings  (2.1)  that  allow  each  pro¬ 
jection  Pj  to  be  computed  as  a  simultaneous  set  of  smaUer  least  squares  subproblems.  These  independent 
subproblems  can  then  be  solved  in  parallel. 

To  iUustrate  this  idea,  consider  the  case  when  A  has  24  block  rows,  and  label  them  as  1  through  24. 
Then  by  putting  block  rows  1,  2,  5,  8,  9,  10,  13,  14,  17,  18,  21,  22  into  B^  and  the  others  into  B^T  both 
will  have  8  completely  separate  larger  blocks.  This  will  be  denoted  by 

B?  :  (1,2), (5, 6), (9, 10), (13, 14), (17, 18), (21, 22) 

B2T  :  (3, 4), (7, 8), (11, 12), (15, 16), (19, 20), (23, 24), 

where  the  parenthesis  indicate  the  separation  between  the  larger  blocks. 

This  is  not  the  only  partition  that  gives  disjoint  blocks  for  the  two  projections.  The  number  of  rows 
put  into  each  larger  block  can  be  varied,  and  it  is  not  necessary  for  B|T  and  to  have  the  same  siie  of 
blocks.  Furthermore  some  rows  can  be  allowed  to  appear  more  than  once,  e.g,,  row  3  can  be  put  into  one 
larger  block  of  Bf  and  one  of  Bj. 
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Although  tests  are  restricted  to  block  tridiagonal  systems  in  this  paper,  the  only  property  that  is 
important  here  is  that  the  system  can  be  row  permuted  to  have  separate  blocks  in  each  B*.  In  particular, 
unstructured  sparse  systems  generally  have  this  property.  Block  tridiagonal  systems  are  used  here  because 
they  allow  an  a  priori  determination  of  suitable  partitionings. 

C. 2.  Solving  the  least  squares  problems 

While  all  of  the  above  versions  allow  parallel  computations,  it  is  still  necessary  to  simultaneously 
solve  least  square  subproblems  of  the  form 

min  ||  w  -  Cs|| ,  (3.1) 

where  C  has  block  dimensions  4X2  for  the  above  partition,  and  w  consists  of  the  components  of  f 
corresponding  to  C's  columns. 

Several  algorithms  exist  for  solving  problem  (3.1).  Orthogonal  factorisation  techniques  [GoVL83] 
have  the  advantage  of  being  numerically  stable,  but  require  a  relatively  large  amount  of  storage  for  the 
factors.  Paige  and  Saunder’s  algorithm  LSQR  [PaSa82]  is  an  iterative  method  which  solves  the  augmented 
system 

1  CHr 

CT  0  * 

it  proved  to  be  rather  time  consuming. 

Another  possible  approach  is  to  solve  the  normal  equations 

CTCs  =  CTw.  (3.3) 

While  it  can  be  argued  that  this  is  no  improvement  over  solving  the  normal  equations  for  the  original  sys¬ 
tem  Ax  =  f,  test  results  of  section  4  show  that  in  practice  the  subproblems  (3.3)  remain  well-conditioned 
even  when  the  original  system  is  ill-conditioned.  Combined  with  the  gain  due  to  parallelism,  this  makes 
(3.3)  a  reasonable  approach.  Furthermore  CTC  is  symmetric  positive  definite  and  so  more  efficient  solvers 
are  available. 

The  discussion  above  is  not  meant  to  be  exhaustive  but  rather  to  reveal  the  issues  involved.  The 
solution  method  chosen  for  (3.1)  or  (3.3)  will  depend  on  the  structure  of  C  which  in  turn  will  depend  on 
the  source  of  the  matrix  A.  An  inexpensive  iterative  method  with  an  early  termination  criterion  could  be 
used  during  the  first  few  iterations  of  BSSOR,  with  a  direct  method  subsequently  used  when  the  solution  is 
near  and  high  accuracy  of  the  inner  iterations  is  needed.  This  flexibility  of  the  algorithm  allows  it  to  be 
custom-tailored  to  the  problem  being  solved. 

D.  Numerical  Results 


D.l.  Test  Problems 

BSSOR  is  tested  with  block  tridiagonal  systems  created  by  using  5-point  central  difference  operators 
corresponding  to  elliptic  partial  differential  equations  with  Dirichlet  boundary  conditions  on  the  unit 
square.  The  equations  are  the  same  set  used  in  [Kama86],  where  a  uniform  grid  of  sise  h  =  l/(n+l)  is  used 
for  both  the  x  and  y  coordinates  with  n  ranging  between  8  and  64.  Note  that  in  this  case  n  is  both  the 
block  sise  and  the  number  of  blocks,  so  that  A  is  of  order  n2.  The  right-hand  side  function  g  is  selected  to 
give  a  predetermined  solution  so  that  both  the  residual  norm  and  error  norm  can  be  checked.  The  test 
problems  and  their  solutions  follow: 


Problem  It 


-  =  8 
u  =  x  +  y 


Problem  2t 

-u„  -  [(l  +  xy)uy],  -  0  [cos(x)  u„  +  (e-x  +  x)  uy]  +  3u  =  g 
0  =  5,  250,  10000 
u  =  x  +  y 


Problem  3s 

-(e-^u,),  -  (e*7^),  +  0(x  +  y)u7  +  [^(x  +  yju],  +  (l+x  +  y)-*u  =  g 

0  =  5,  250,  10000 
u  =  xeXTsin(srx)sin(5ry) 


Problem  4s 


- ««  -  ~  lOOxu,  +  yu7  -  l0oix±ylu  =  g 


u  =  x  +  y 


Problem  5i 


-  u«  —  -  xux  +  200yu7  -  300u  =  g 

u  =  x  +  y 


Problem  6s 

-u„  -  Ujy  +  lOOOe*7 ( ux  -  u7)  =  g 
u  =  x  +  y 

In  the  testing  2a,  2b,  2c  and  3a,  3b,  3c  are  used  to  differentiate  between  the  values  0  =  5,  250,  1000  , 
respectively. 

These  problems  are  chosen  to  represent  a  variety  of  possible  eigenvalue  distributions  for  the  matrix 
A.  In  particular,  problems  4-6  have  eigenvalues  on  both  sides  of  the  imaginary  axis  and  symmetric  parts 
that  have  both  positive  and  negative  eigenvalues  [KaSa86],  features  that  make  these  problems  especially 
challenging.  Additionally,  problem  4  has  a  condition  number  that  grows  rapidly  with  n;  for  n  =  32  the 
condition  number  is  already  6.0  X  10lJ. 
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D.2.  Stopping  Conditions 


Each  experiment  is  terminated  whenever  the  initial  residual  of  the  original  system  is  reduced  by  10-®,  that 
is,  whenever 

llrkll  <  10~®||r0|| ,  (4.1) 

where  rk  =  f  —  Axk.  The  only  exception  to  this  is  when  GMRES(k)  is  used  with  a  preconditioner,  in  which 
case  the  residual  of  the  preconditioned  system  is  used.  The  outer  CG  iteration  is  started  with  an  initial 
estimate  of  x  =  0  so  that  the  initial  residual  is  simply  the  right  hand  side  vector.  An  additional  stopping 
condition  is  provided  in  case  too  many  iterations  of  the  CG  algorithm  are  executed.  This  condition,  how¬ 
ever,  is  not  met  in  any  of  the  test  runs. 


D.3.  Use  of  Parallelism 

All  tests  are  performed  on  an  Alliant  FX/8  at  the  Center  for  Supercomputing  Research  and  Develop¬ 
ment  of  the  University  of  Illinois.  The  FX/8  is  a  shared  memory  multiprocessor  with  8  computational  ele¬ 
ments  (CEs).  Each  CE  has  vector  processing  ability,  with  8  vector  registers  containing  32  64-bit  words 
each. 

The  speedup  sk  realised  by  k  CEs  is  defined  as 

sk  :=  /  *k » 

where  tj  is  the  time  required  for  execution  on  j  processors.  In  order  to  keep  the  other  programming  vari¬ 
ables  constant,  for  all  runs  Partition  1  is  used  and  the  least  squares  subproblems  are  solved  by  using  a 
Cholesky  factorisation  of  the  normal  equations.  Vectorisation  is  used  for  both  the  1  CE  and  the  8  CE 
runs. 

Concurrency  can  occur  in  two  basic  ways  in  BSSOR.  First,  when  computing  a  projection  y  =  P(x 
each  least  squares  subproblem  can  be  solved  simultaneously.  For  the  partitioning  used,  this  yields  a  possi¬ 
ble  n/4  parallel  tasks.  Vectorisation  can  then  be  used  within  each  least  squares  subproblem  so  that  the 
overall  projection  is  calculated  in  concurrent-outer/vector-inner  mode.  Second,  the  basic  vector  opera¬ 
tions  such  as  inner  products  and  (vector  4-  scalar*vector)  required  in  the  CG  method  can  be  spread  among 
the  available  processors,  i.e.,  executed  in  vector-concurrent  mode.  The  latter  is  possible  in  any  linear 
solver  that  involves  long  vectors,  while  the  former  is  unique  to  BSSOR. 

Speedups  realised  on  the  8  CE’s  of  the  Alliant  FX/8  range  from  5  to  6  for  problems  tested. 


D.4.  Solving  the  Least  Squares  Subproblems 

Before  discussing  the  method  of  choice,  some  observations  regarding  the  structure  of  C  and  CTC  are 
in  order.  For  the  above  test  problems,  with  the  proposed  partition  (two  block  row  partitioning)  C  is  of  the 
form, 


D  0 
T  D 
D  T  ’ 
0  D 


and  CTC  = 


P  T 
Tt  P 


In  the  above  forms  D,  T,  and  P  represent  a  diagonal,  tridiagonal,  and  pentadiagonal  matrix,  respectively, 
where  in  general  each  one  may  be  different. 
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Prom  the  minimax  characterization  of  the  singular  values  of  C  and  A  [GoVL83j  it  is  easy  to  show 
that  the  condition  number  x(C)  of  C  can  never  be  worse  than  that  of  A.  Table  4.4  shows  the  maximum  of 
k(C)  over  all  the  subproblems  involved  in  problem  4,  along  with  the  estimates  for  x(A)  given  in  [KaSa86], 
The  condition  of  the  subproblems  is  generally  much  better  than  the  overall  condition  of  A,  and  this  holds 
for  each  of  the  5  test  problems.  A  possible  explanation  is  that  the  ill-conditioning  of  A  appears  not  in  the 
subproblems  but  instead  in  the  angles  between  the  range  spaces  of  Pt  and  P2. 


Table  1:  maxttfCl  vs.  k(A1 

n 

Condition 

dumber 

Max  of  /c(C) 

k[A) 

8 

18 

32 

64 

0.2  X101 

0.2  XlO2 

0.1  XlO2 

0.1  XlO2 

0.8  XlO2 
0.2  XlO* 

0.6  X1011 

* 

*  Not  calculated 

Another  issue  in  choosing  a  solver  for  the  least  squares  subproblems  is  that  a  stopping  criterion  must 
be  set  for  the  iterative  methods.  Let  "inner"  refer  to  the  subproblem  computations  and  "outer"  refer  to  the 
computations  specified  by  BSSOR.  It  is  reasonable  to  require  the  inner  accuracy  to  be  at  least  as  good  as 
the  outer  accuracy.  If  it  is  much  more  accurate,  then  the  projectors  Pj  are  well  defined  and  the  number  of 
outer  iterations  needed  is  the  same  as  that  required  by  the  direct  methods,  but  more  work  must  be  done  on 
each  inner  iteration.  Conversely  if  the  inner  accuracy  required  approaches  that  of  the  outer  accuracy,  less 
work  is  done  on  each  inner  iteration  but  now  the  projectors  are  defined  with  random  errors  on  the  order  of 
that  desired  in  the  solution;  more  outer  iterations  will  be  performed  and  the  algorithm  may  even  fail  to 
converge.  Because  one  of  the  motivations  for  BSSOR  is  to  achieve  the  robustness  that  other  methods  lack, 
all  of  the  experiments  use  an  inner  tolerance  for  the  residual  norm  that  is  10  times  more  stringent  than  the 
outer  tolerance. 

Results  comparing  the  solvers  for  problems  4-6  show  that  using  preconditioned  C.G.  schemes  for 
solving  the  least  squares  subproblems  (normal  equations)  require  the  least  time  and  storage.  These 
methods  have  interesting  behavior;  as  the  outer  iterations  proceed  the  inner  CG  method  requires  fewer  and 
fewer  iterations. 


D.6.  Comparison  with  Other  Methods 

Four  algorithms  are  compared  to  the  BSSOR  method.  A  version  of  PCGPAK  supplies  three:  Gen¬ 
eralised  Conjugate  Residuals  (GCR(k)),  Generalized  Minimum  Residual  (GMRES(k)),  and  ORTHOMIN(k) 
[Elma82,SaScS6,EiES83j.  The  fourth  is  the  conjugate  gradient  method  used  on  the  normal  equations  of  A; 
this  should  not  be  confused  with  the  use  of  conjugate  gradients  to  solve  the  least  square-  subproblems  in 
BSSOR.  Preconditioners  used  in  PCGPAK  are  ILU(O),  MILU(O),  and  SSOR(l);  while  CG  is  tested  with 
and  without  IC(0)  preconditioning. 

The  stopping  criterion  adopted  halts  iterations  when  the  relative  residual  norm  is  reduced  by  a  fac¬ 
tor  of  10~*.  One  important  difference  is  that  GMRES(k)  must  use  the  residual  of  the  preconditioned  sys¬ 
tem  for  the  stopping  criterion.  Since  the  initial  preconditioned  residual  can  be  many  orders  of  magnitude 
larger  than  the  unpreconditioned  residual,  this  sometimes  causes  GMRES(k)  to  terminate  prematurely. 
Another  stopping  criterion  is  necessary  to  guard  against  stalling,  and  in  the  experiments  the  iterations  are 
terminated  if  the  residual  norm  is  not  reduced  by  10~*  within  10  iterations. 

The  version  of  PCGPAK  used  is  optimized  for  the  Alliant  FX/8,  and  exploits  both  the  vectorization 
and  concurrent  features  of  that  machine  [AnSa88], 
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CG  is  also  used  with  the  incomplete  Cholesky  factorisation  preconditioner  IC(0).  Since  ATA  is  not 
necessarily  an  M-matrix,  there  is  no  assurance  that  the  factorisation  can  be  carried  out  to  completion.  A 
parameter  a  may  be  added  to  the  diagonal  elements  during  the  factorisation  to  allow  it  to  proceed,  but  as 
with  the  PCGPAK  methods  this  is  set  to  0  for  the  tests.  It  is  important  to  note  that  in  general  such 
parameters  are  difficult  to  select  adaptively,  and  in  part  the  intent  of  these  experiments  is  to  use  the 
methods  as  black-box  solvers. 


D.5.1.  Problems  Compared 

Only  a  subset  of  problems  1-8  are  presented  here.  Problem  3c  has  a  positive  definite  symmetric  part 
and  so  should  be  suitable  for  ORTHOMIN(k)  and  GCR(k).  Problems  4-6  have  indefinite  symmetric  parts; 
problem  4  additionally  is  ill-conditioned  making  it  difficult  for  CG  to  solve.  Problem  site  of  n  =  84  is 
used,  so  that  full  parallelism  can  be  achieved  by  BSSOR. 


D.5.2.  Results 

An  upper  limit  of  2000  iterations  is  imposed  on  each  run,  and  experiments  that  exceed  that  bound 
are  regarded  as  failing  and  marked  with  an  error  code  of  MX.  Three  other  conditions  are  labeled  failures. 
If  more  than  10  iterations  occur  without  a  reduction  in  the  residual  the  run  is  marked  RS  for  "Residual 
Stalls".  If  the  preconditioner  cannot  be  formed,  an  error  code  of  UP  for  "Unstable  Preconditioner''  is  used. 
Finally,  because  GMRES(k)  uses  the  residual  for  the  preconditioned  system  as  a  stopping  criterion,  it  can 
terminate  iterations  while  the  true  residual  is  still  large.  If  it  exceeds  1,  the  run  is  labeled  LR  for  "Large 
Residual". 

Fig.  2  shows  which  failures  occur  for  the  various  methods.  Clearly  BSSOR  has  a  robustness 
unmatched  by  the  other  solvers. 

Figure  3  compares  the  efficiency  of  BSSOR  with  the  other  unpreconditioned  methods.  The  closest 
one  in  robustness  is  CG  on  the  normal  equations.  The  preconditioned  methods,  however,  can  out  perform 
BSSOR  when  they  work. 

This  comparison  of  BSSOR  with  preconditioned  methods  is  unfair  in  one  way.  Only  a  default  value 
of  the  parameter  a  defined  in  section  4.6. 1.1  is  used.  In  practice  the  user  may  be  able  to  find  values  that 
prevent  the  factorisation  failures  from  occurring.  This  is  not  done  here  because  the  intent  of  the  testing  is 
to  measure  how  the  methods  compare  as  "black-box"  solvers.  Given  some  knowledge  of  the  systems  being 
solved,  as  is  the  case  when  the  same  system  is  solved  witl.  several  different  right-hand  sides,  adaptively 
finding  and  using  an  optimal  acceleration  parameter  may  be  more  practical.  Tables  A.l  -  A.4  summarize 
the  results  for  problems  3C-8  with  n  =  64. 

Another  possible  objection  to  the  comparisons  with  the  CG-like  methods  is  that  the  value  of  k  may 
be  too  small.  To  see  if  increasing  k  improves  robustness,  GMRES(k)  is  tested  with  k  =  20  on  problems  4  - 
6  with  n  =  64..  The  results  are  shown  in  Table  A. 5. 

In  summary,  three  results  of  the  testing  stand  out.  BSSOR  is  more  reliable  than  the  other  solvers,  is 
faster  than  the  unpreconditioned  methods,  and  is  generally  slower  than  the  preconditioned  methods  when 
they  succeed. 


E.  Conclusions  and  future  research 

The  BSSOR  method  for  solving  linear  systems  is  a  remarkably  robust  iterative  scheme  that 
transforms  a  nonsymmetric  system  with  an  arbitrary  eigenvalue  distribution  into  a  symmetric  one  with 
eigenvalues  restricted  to  (0,1).  When  applied  to  block  tridiagonal  systems  concurrent  operations  on 
separate  blocks  becomes  possible.  Of  the  many  block  row  partitionings  that  yield  parallelism,  the  ones 
that  allow  the  most  concurrent  tasks  perform  the  best. 
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Method 


BSSOR 


GCR(3 


Figure  2:  Failures  of  Methods 


Problem 

n  =  32  I 


3c  4  5  8  3c 


GCR/31-MILU 


GCR13VSSOR 


GMRES13 


GMRES(3V-ILU 


GMRES13V-MILU 


GMRES(3)-SSOR 


ORTHOMIN13 


ORTHOMINf3VILU 


ORTHOMIN13V-MILU 


ORTHOMIN13V-SSOR 


CG 


CG-ICfO 


Failure  codes: 


RS  I  RS 


RS 


UP  MX 


RS  RS 


RS 


UP  I  RS 


RS  RS 


RS  RS  RS 


n  =  64 


5 


RS 

RS 

RS 

RS 

RS 

RS 

RS 

RS 

RS 

RS 

UP  I  RS  RS 


RS 


UP  I  RS  I  RS  I  UP  I  UP  I  RS  RS 


MX 


UP  =  Unstable  Preconditioner 
RS  —  Residual  Stagnates 
MX  =  Exceeds  Maximum  Iterations 
LR  =  Large  Residual 


Problem 


Figure  3:  Times  for  Unpreconditioned  Methods  (sec 


Method 


BSSOR 


0.94 


6.66 


2.43 


2.56 


5.99 


79.31 


28.42 


17.33 


12.24 


*  Method  Failed 


For  the  problems  examined  here,  each  of  the  generated  subproblems  in  turn  is  highly  structured, 
making  them  amenable  to  being  solved  on  a  multiprocessor.  This  makes  the  algorithm  particularly  suited 
for  a  hierarchal  memory  machine  such  as  Cedar.  This  multiprocessor  is  implemented  with  a  global 
memory  which  supports  up  to  8  clusters;  each  cluster  currently  is  an  Alliant  FX/8  with  multiprocessing 
and  vector  capabilities  and  its  own  memory.  With  this  architecture  the  separate  blocks  created  by  BSSOR 
can  be  assigned  to  individual  clusters.  Solving  each  least  squares  subproblem  can  then  be  done  entirely 


-9- 


within  a  cluster.  Global  memory  can  hold  the  vectors  defined  by  the  outer  CG  iteration,  and  data  com¬ 
munication  between  clusters  can  be  exclusively  through  those  vectors.  Implementation  of  BSSOR  on 
Cedar  would  allow  the  treatment  of  partial  differential  equations  in  three  dimensions,  where  each  least 
squares  subproblem  is  large  scale,  sparse,  and  structured,  and  should  be  solved  via  an  iterative  scheme. 
This  is  an  important  future  research  issue  for  which  this  activity  is  in  need  of  a  more  powerful  multipro¬ 
cessor,  e.g.  an  FX/80  which  has  more  powerful  computational  elements,  larger  cache,  and  larger  memory. 

BSSOR  compares  favorably  with  CG-like  iterative  methods  for  nonsymmetric  systems  when  they  are 
used  without  parameter  estimation.  The  storage  requirements  are  minimal  and  reliability  is  high.  Usually 
BSSOR  is  more  efficient  than  the  unpreconditioned  methods,  and  is  slower  only  if  the  CG-like  method  is 
successfully  used  with  a  preconditioner. 

Although  testing  has  been  restricted  to  block  tridiagonal  systems  in  this  paper,  BSSOR  can  be 
applied  to  other  problems.  The  crucial  property  of  block  tridiagonal  systems  is  that  they  can  be  easily 
row-permuted  to  yield  separable  systems  in  BjT.  Any  sparse  system  of  equations  that  can  be  reordered 
into  a  banded  system  can  also  be  solved  using  this  method,  only  now  the  blocks  have  variable  size.  Nar¬ 
rower  bandwidths  will  give  greater  parallelism  and  hence  efficiency,  but  the  method  can  still  be  applied 
even  in  the  dense  or  sparse  unstructured  case  [Kacz37].  Furthermore,  the  system  need  not  be  linear.  These 
last  two  points,  unstructured  sparse  linear  problems  and  handling  sparse  nonlinear  systems,  are  important 
research  issues  for  further  extension  of  BSSOR.  Another  potential  extension  of  the  method  is  to  sparse 
least  squares  problems.  The  adjustment  of  geodetic  networks,  for  example,  leads  to  least  squares  problems 
where  the  coefficient  matrix  has  block  angular  form  [G0PS88].  This  form  requires  no  further  permutation 
to  be  suitable  for  block  row  projection  methods. 


Table  A.lt  Comparison  of  Methods  for  Problem  3c.  n  =  64 


■ES3S5BH 

Precond 

Iters 

Time 

Residual 

Failure  Code 

Block  SSOR 

116 

0.599464E+01 

0.575479E-03 

GCR(3) 

1947 

0.503018E+02 

0.474242E-03 

ILU 

7 

0.209307E+01 

0.328218E-03 

MILU 

5 

0.183364E+01 

0.164215E-03 

SSOR 

UP 

GMRES(3) 

- 

1947 

0.492165E+02 

0.474242E-03 

ILU 

7 

0.189987E+01 

MILU 

5 

0.151040E+01 

0.16421SE-03 

SSOR 

UP 

ORTHOMIN(3) 

1639 

0.493057E+02 

0.474713E-03 

ILU 

7 

0.209663E+01 

0.305439E-03 

MILU 

5 

0.163831E+01 

0.151166E-03 

SSOR 

UP 

CG 

- 

616 

0.143547E+02 

0.471824E-03 

m 

7 

0.944420E+00 

Failure  codes: 


UP  =  Unstable  Preconditioner 
RS  =  Residual  Stagnates 
MX  =  Exceeds  Maximum  Iterations 
LR  =  Large  Residual _ 


Table  A.2:  Comparison  of  Methods  for  Problem  4.  n  =■  64 


Method 

Precond 

Iters 

Time 

Residual 

EZSHsffEI 

Block  SSOR 

_ 

300 

0.479802E-04 

GCR(3) 

— 

629 

0.161787E+02 

0.150230E+02 

RS 

46 

0.994555E+01 

0.315676E+02 

RS 

MILU 

.1 

0.851480E+00 

0.181254E-06 

SSOR 

647 

0.135435E+03 

0.156203E+02 

RS 

GMRES{3) 

629 

0.162156E+02 

0.150230E+02 

RS 

ILU 

38 

0.745205E+01 

0.315676E+02 

RS 

1 

0.801980E+00 

0.181254E-06 

SSOR 

28 

0.513346E+01 

0.156204E+02 

RS 

ORTHOMIN(3) 

17 

0.528290E+00 

0.176171E+02 

RS 

ILU 

224 

0.468544E+02 

0.325586E+02 

RS 

MILU 

1 

0.847800E+00 

0.181254E-06 

SSOR 

126 

0.272351E+02 

0.166143E+02 

RS 

CG 

_ 

2000 

0.429486E+02 

0.944334E-03 

MX 

H3H 

UP 

Failure  codes: 


UP  —  Unstable  Preconditioner 
RS  =  Residual  Stagnates 
MX  =  Exceeds  Maximum  Iterations 
LR  =  Large  Residual _ 


Table  A.3:  Comparison  of  Methods  for  Problem  5.  n  =  64 


Method  I  Precon  d 


Block  SSOR 


GCR(3) 


GMRES(3) 


ORTHOMIN(3) 


ILU 


MILU 


SSOR 


ILU 


MILU 


SSOR 


ILU 


MILU 


SSOR 


Failure  codes: 


78 


1884 


398 


UP  = 
RS  = 
MX  = 
LR  = 


0.344119E+01 


0.238077E+02 


0.893210E+00 


inm 


0.347982E+01 


0.131598E+02 


0.892620E+00 


0.224104E+02 


0.196827E+01 


0.203837E+02 


0.889790E+00 


0.166553E+02 


0.430478E+02 


0.352268E+02 


Unstable  Preconditioner 
Residual  Stagnates 
:  Exceeds  Maximum  Iterations 
Large  Residual 


Residual 


0.975404E-05 


0.605459E+00 


0.477515E+00 


0.561537E-12 


0.433020E+00 


0.805459E+00 


0.477517E+00 


0.580116E-12 


0.433020E+00 


.813196E+00 


0.528433E+00 


0.561537E-12 


0.479316E+00 


.148442E-04 


.150832E-04 


Failure  Code 


Method 


Block  SSOR 


GCR(3) 


GMRES(3) 


Table  A.4:  Comparison  of  Methods  for  Problem  6.  n  =  64 


Precond 


ILU 


MILU 


SSOR 


Iters 

Time 

Residual 

109 

0.173253E+02 

0.192833E-03 

483 

0.123350E+02 

0.254707E-03 

2000 

0.401148E+03 

0.149972E+08 

BiaBfflBEBEll 


.122430E+02 


0.111076E+01 


0.254707E-03 


0.285255E+03 


ORTHOMIN(3) 


458 

0.134819E+02 

0.240795E-03 

206 

0.428312E+02 

0.149968E+06 

657 

0.151798E+02 

0.256340E-03 

90 

0.855858E+01 

0.252914E-03 

UP  =  Unstable  Preconditioner 

Failure  codes: 

RS  =  Residual  Stagnates 

MX  =  Exceeds  Maximum  Iterations 

LR  =  Large  Residual 

Table  A.5:  GMRESI201  with  n  =  64 


Problem 

Precond 

Iters 

Time 

Residual 

Notes 

4 

— 

77 

0.279077E+01 

0.127500E+02 

mm 

ILU 

52 

0.105483E+02 

0.212383E+02 

RS 

MILU 

1 

0.804430E+00 

0.1812S4E-08 

SSOR 

31 

0.631055E+01 

0.158618E+02 

RS 

5 

_ 

218 

0.787123E+01 

0.497596E+00 

RS 

ILU 

79 

0.158311E+02 

0.730819E-05 

MILU 

1 

0.884910E+  00 

0.560116E-12 

SSOR 

103 

0.208793E+02 

0.905222E-05 

6 

— 

476 

0.172964E+02 

0.249851E-03 

ILU 

2 

0.112231E+01 

0.265255E+03 

LR 

MILU 

— 

_ 

- 

UP 

SSOR 

mm 

UP 

UP  =  Unstable  Preconditioner 

RS  =  Residual  Stagnates 

MAXIT  =  Exceeds  Maximum  Iterations 

LR  =  Large  Residual _ 


Failure  codes: 
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